58 research outputs found
Sequence Modelling with Deep Learning for Visual Content Generation and Understanding
University of Technology Sydney. Faculty of Engineering and Information Technology.Although convolutional neural networks have proven to be effective and stable in image feature learning, sequence modelling is still critical for learning spatial and temporal context information. In an image scenario, different semantic structures can be regarded as a sequence arranged along the horizontal (or vertical) direction. Moreover, in a video scenario, temporal sequence modelling is necessary for understanding inter-frame relationships, such as object movement and occlusion. This thesis explores more effective spatial or temporal sequence modelling for image or video scenario understanding. For the former, an encoder-decoder framework is proposed to split an input scenario into a sequence of spatial features and reconstruct the input. By modelling spatial sequence information, the framework can even predict new scenes with very large scales in length while keeping a consistent style regarding the given input. For video understanding, the thesis processes temporal sequences in a recurrent manner (i.e., frame by frame), which is more memory-efficient. In addition, the thesis proposes to implicitly impose the feature embedding of each target and relative background to be contrastive throughout the temporal sequence, promoting the results of downstream tasks accordingly. Besides, a novel transformation module is designed to model channel relationships for improving intra-frame representation ability. To validate proposed approaches and components, extensive experiments are conducted on image outpainting, instance segmentation, object detection, classification, video classification, and video object segmentation
Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation
Audio-driven talking-head synthesis is a popular research topic for virtual
human-related applications. However, the inflexibility and inefficiency of
existing methods, which necessitate expensive end-to-end training to transfer
emotions from guidance videos to talking-head predictions, are significant
limitations. In this work, we propose the Emotional Adaptation for Audio-driven
Talking-head (EAT) method, which transforms emotion-agnostic talking-head
models into emotion-controllable ones in a cost-effective and efficient manner
through parameter-efficient adaptations. Our approach utilizes a pretrained
emotion-agnostic talking-head transformer and introduces three lightweight
adaptations (the Deep Emotional Prompts, Emotional Deformation Network, and
Emotional Adaptation Module) from different perspectives to enable precise and
realistic emotion controls. Our experiments demonstrate that our approach
achieves state-of-the-art performance on widely-used benchmarks, including LRW
and MEAD. Additionally, our parameter-efficient adaptations exhibit remarkable
generalization ability, even in scenarios where emotional training videos are
scarce or nonexistent. Project website: https://yuangan.github.io/eat/Comment: Accepted to ICCV 2023. Project page: https://yuangan.github.io/eat
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking
The Associating Objects with Transformers (AOT) framework has exhibited
exceptional performance in a wide range of complex scenarios for video object
tracking and segmentation. In this study, we convert the bounding boxes to
masks in reference frames with the help of the Segment Anything Model (SAM) and
Alpha-Refine, and then propagate the masks to the current frame, transforming
the task from Video Object Tracking (VOT) to video object segmentation (VOS).
Furthermore, we introduce MSDeAOT, a variant of the AOT series that
incorporates transformers at multiple feature scales. MSDeAOT efficiently
propagates object masks from previous frames to the current frame using two
feature scales of 16 and 8. As a testament to the effectiveness of our design,
we achieved the 1st place in the EPIC-KITCHENS TREK-150 Object Tracking
Challenge.Comment: Top 1 solution for EPIC-KITCHEN Challenge 2023: TREK-150 Single
Object Tracking. arXiv admin note: text overlap with arXiv:2307.0201
Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction
Reconstructing 3D clothed human avatars from single images is a challenging
task, especially when encountering complex poses and loose clothing. Current
methods exhibit limitations in performance, largely attributable to their
dependence on insufficient 2D image features and inconsistent query methods.
Owing to this, we present the Global-correlated 3D-decoupling Transformer for
clothed Avatar reconstruction (GTA), a novel transformer-based architecture
that reconstructs clothed human avatars from monocular images. Our approach
leverages transformer architectures by utilizing a Vision Transformer model as
an encoder for capturing global-correlated image features. Subsequently, our
innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane
features, using learnable embeddings as queries for cross-plane generation. To
effectively enhance feature fusion with the tri-plane 3D feature and human body
prior, we propose a hybrid prior fusion strategy combining spatial and
prior-enhanced queries, leveraging the benefits of spatial localization and
human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0
datasets illustrate that our method outperforms state-of-the-art approaches in
both geometry and texture reconstruction, exhibiting high robustness to
challenging poses and loose clothing, and producing higher-resolution textures.
Codes will be available at https://github.com/River-Zhang/GTA.Comment: Accepted by NeurIPS 2023. Project page:
https://river-zhang.github.io/GTA-projectpage
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of
sound-producing objects within image frames and ensure the maps faithfully
adhere to the given audio, such as identifying and segmenting a singing person
in a video. However, existing methods exhibit two limitations: 1) they address
video temporal features and audio-visual interactive features separately,
disregarding the inherent spatial-temporal dependence of combined audio and
video, and 2) they inadequately introduce audio constraints and object-level
information during the decoding stage, resulting in segmentation outcomes that
fail to comply with audio directives. To tackle these issues, we propose a
decoupled audio-video transformer that combines audio and video features from
their respective temporal and spatial dimensions, capturing their combined
dependence. To optimize memory consumption, we design a block, which, when
stacked, enables capturing audio-visual fine-grained combinatorial-dependence
in a memory-efficient manner. Additionally, we introduce audio-constrained
queries during the decoding phase. These queries contain rich object-level
information, ensuring the decoded mask adheres to the sounds. Experimental
results confirm our approach's effectiveness, with our framework achieving a
new SOTA performance on all three datasets using two backbones. The code is
available at \url{https://github.com/aspirinone/CATR.github.io}Comment: accepted by ACM MM 202
AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion
Large-scale pre-trained vision-language models allow for the zero-shot
text-based generation of 3D avatars. The previous state-of-the-art method
utilized CLIP to supervise neural implicit models that reconstructed a human
body mesh. However, this approach has two limitations. Firstly, the lack of
avatar-specific models can cause facial distortion and unrealistic clothing in
the generated avatars. Secondly, CLIP only provides optimization direction for
the overall appearance, resulting in less impressive results. To address these
limitations, we propose AvatarFusion, the first framework to use a latent
diffusion model to provide pixel-level guidance for generating human-realistic
avatars while simultaneously segmenting clothing from the avatar's body.
AvatarFusion includes the first clothing-decoupled neural implicit avatar model
that employs a novel Dual Volume Rendering strategy to render the decoupled
skin and clothing sub-models in one space. We also introduce a novel
optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which
semantically separates the generation of body and clothes, and generates a
variety of clothing styles. Moreover, we establish the first benchmark for
zero-shot text-to-avatar generation. Our experimental results demonstrate that
our framework outperforms previous approaches, with significant improvements
observed in all metrics. Additionally, since our model is clothing-decoupled,
we can exchange the clothes of avatars. Code will be available on Github
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video Object Segmentation
The Associating Objects with Transformers (AOT) framework has exhibited
exceptional performance in a wide range of complex scenarios for video object
segmentation. In this study, we introduce MSDeAOT, a variant of the AOT series
that incorporates transformers at multiple feature scales. Leveraging the
hierarchical Gated Propagation Module (GPM), MSDeAOT efficiently propagates
object masks from previous frames to the current frame using a feature scale
with a stride of 16. Additionally, we employ GPM in a more refined feature
scale with a stride of 8, leading to improved accuracy in detecting and
tracking small objects. Through the implementation of test-time augmentations
and model ensemble techniques, we achieve the top-ranking position in the
EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge.Comment: Top 1 solution for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video
Object Segmentatio
- …